5 research outputs found

    Machine Learning Techniques for Topic Detection and Authorship Attribution in Textual Data

    Get PDF
    The unprecedented expansion of user-generated content in recent years demands more attempts of information filtering in order to extract high-quality information from the huge amount of available data. In this dissertation, we begin with a focus on topic detection from microblog streams, which is the first step toward monitoring and summarizing social data. Then we shift our focus to the authorship attribution task, which is a sub-area of computational stylometry. It is worth mentioning that determining the style of a document is orthogonal to determining its topic, since the document features which capture the style are mainly independent of its topic. We initially present a frequent pattern mining approach for topic detection from microblog streams. This approach uses a Maximal Sequence Mining (MSM) algorithm to extract pattern sequences, where each pattern sequence is an ordered set of terms. Then we construct a pattern graph, which is a directed graph representation of the mined sequences, and apply a community detection algorithm to group the mined patterns into different topic clusters. Experiments on Twitter datasets demonstrate that the MSM approach achieves high performance in comparison with the state-of-the-art methods. For authorship attribution, while previously proposed neural models in the literature mainly focus on lexical-based neural models and lack the multi-level modeling of writing style, we present a syntactic recurrent neural network to encode the syntactic patterns of a document in a hierarchical structure. The proposed model learns the syntactic representation of sentences from the sequence of part-of-speech tags. Furthermore, we present a style-aware neural model to encode document information from three stylistic levels (lexical, syntactic, and structural) and evaluate it in the domain of authorship attribution. Our experimental results, based on four authorship attribution benchmark datasets, reveal the benefits of encoding document information from all three stylistic levels when compared to the baseline methods in the literature. We extend this work and adopt a transfer learning approach to measure the impact of lower-level linguistic representations versus higher-level linguistic representations on the task of authorship attribution. Finally, we present a self-supervised framework for learning structural representations of sentences. The self-supervised network is a Siamese network with two components; a lexical sub-network and a syntactic sub-network which take the sequence of words and their corresponding structural labels as the input, respectively. This model is trained based on a contrastive loss objective. As a result, each word in the sentence is embedded into a vector representation which mainly carries structural information. The learned structural representations can be concatenated to the existing pre-trained word embeddings and create style-aware embeddings that carry both semantic and syntactic information and is well-suited for the domain of authorship attribution

    Maximal Sequence Mining Approach For Topic Detection From Microblog Streams

    No full text
    Unprecedented expansion of user generated content in recent years demands more attempts of information filtering in order to extract high quality information from the huge amount of available data. In particular, topic detection from microblog streams is the first step toward monitoring and summarizing social data. This task is challenging due to the short and noisy characteristics of microblog content. Moreover, the underlying models need to be able to deal with heterogeneous streams which contain multiple stories evolving simultaneously. In this work, we introduce a frequent pattern mining approach for topic detection from a microblog stream. This approach first uses a Maximal Sequence Mining (MSM) algorithm to extract pattern sequences, each an ordered set of terms. This scheme can capture more semantic information than using unordered sets of the same terms. A pattern graph, which is a directed-graph representation of the mined sequences, can then be constructed. Subsequently, a community detection algorithm is applied on the pattern graph to group the mined patterns into different topic clusters. Experiments on Twitter datasets demonstrate that MSM approach achieves high performance in comparison with the state-of-the-art methods

    Fault-Tolerant Network-Server Architecture For Time-Critical Web Applications

    No full text
    Parallel to recent evolution in web content, technologies such as WebSockets and WebRTC allow development of real-time web applications - something that has not been possible in RESTful HTTP platforms. Future IoT applications will also require real-time data retrieval as information travels through and is processed by web servers. For these reasons, we envision a demand for real-time web applications which require a high cost to service providers in case of service interruption. A time-critical real-time application is assumed to have the capability to handle failures in infrastructure, not with a service interruption, but with minor performance degradation. In this study, a fully-redundant system architecture for web server deployment, including the network and server, with fast failover capabilities is presented. Unlike interruptions that require applications to be re-initialized, this proposed system provides fail-over speeds for every component, such that, even the fast real-time applications can continue to operate. Experiments with application scenarios prove superior failover speeds under several test cases and reveal limitations in the latest technologies in supporting the aforementioned applications
    corecore